Monitoring et gestion de jobs
squeue
Shows your jobs currently waiting in the queue or running.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
... ... ...
This command allows you to obtain informations such as the ID of a job, the nodes reserved, elapsed time etc.
The ST column shows the state of the job : among the possible states the most frequent are: R (Running), PD (Pending), F (Failed).
For pending jobs (PD), the (REASON) column shows a reason why the job is pending - the list is rather long.
The two reasons frequently met are Priority (other jobs have higher priority) and Resources (waiting for resources to become available). If other reasons are shown it might be useful to check if the resource request is satifiable.
Caution, on Zeus this command shows only your own jobs.
scancel
Allows to cancel your jobs.
scancel JOBIDcancels job JOBID.scancel -n totocancels all jobs named toto.scancel -n toto -t PENDINGcancels all jobs toto in pending state.scancel -u user.logincancels all jobs of user
sinfo
Gives information of the current state of the cluster, available resources and their configuration.
It is possible to format sinfo's output to get more or less detailed information.
For instance,
sinfo -sgives a summary of the state of the clustersinfo -N --longgives more detailed node-by-node information
sacct
Gives information on past jobs. For example,
sacct -S MMDD
returns a list of jobs submitted since a given date where MM corresponds to the month and DD to the day of the current year.
For instance, to get the history of jobs since July, 15 :
sacct -S 0715
You can also define an end date with the -E option :
sacct -S MMDD -E MMDD
Adjust requested memory
Slurm continuously controls the consumed resources in terms of number of cores and amount of memory. Jobs that consume more resources than requested are killed automatically.
If a job exceeds the requested amount of memory, the following error message may appear in the output file :
slurmstepd: error: Exceeded step memory limit at some point.
While it may be difficult to accurately estimate the required amount of memory, Slurm allow to query the amount of memory used by a job after its execution.
After completion of a job you may use the following command (replacing JOBID by your job identifier):
sacct -o jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed -j JOBID
The output will be similar to this one :
ReqMem MaxRSS AveRSS Elapsed
---------- ---------- ---------- ----------
55000Mn 16? 00:08:33
55000Mn 17413256K 16269776K 00:08:33
55000Mn 17440808K 16246408K 00:08:32
where ReqMem is the amount of memory requested with
#SBATCH --mem=55000M
MaxRss is the maximum amount of memory used on one node and AveRSS is the average amount used per node.
Here, the memory consumption has peaked at about 18 GB per node.
You might consider requesting less memory for similar jobs, for example :
#SBATCH --mem=20G
To see this info for past jobs since YYYY-MM-DD:
sacct -o jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed -S YYYY-MM-DD
Watch out: if you get an error message indicating that you've exceeded the memory limit, the shown MaxRSS value is not necessarily larger than ReqMem because the job gets cancelled before Slurm records that value.
The cluster contains three types of nodes :
- 24 cores with 128GB or 192GB of memory
- 32 cores with 192GB of memory
- 32 cores with 512GB of memory
Jobs are automatically placed on different types of nodes according to the requested resources.